Water is life. This is especially true in places where it is sparse as in huge parts of Africa. To provide people with fresh water organizations build water pumps, but oftentimes they do not further pay for maintenance and they break down, becoming useless. The online platform Taarifa collects data of water pumps in Tanzania and wants to predict, which ones are broken or will soon break down to be able to organize maintenance. The data science competition platform drivendata.com hosts a challenge, where the community can help with the prediction effort (http://www.drivendata.org/competitions/7/page/23/). The data used in this report corresponds to the training data provided for the challenge.
As a first step the variables given in the data set will be inspected:
## [1] "amount_tsh" "basin"
## [3] "construction_year" "date_recorded"
## [5] "district_code" "extraction_type"
## [7] "extraction_type_class" "extraction_type_group"
## [9] "funder" "gps_height"
## [11] "id" "installer"
## [13] "latitude" "lga"
## [15] "longitude" "management"
## [17] "management_group" "num_private"
## [19] "payment" "payment_type"
## [21] "permit" "population"
## [23] "public_meeting" "quality_group"
## [25] "quantity" "quantity_group"
## [27] "recorded_by" "region"
## [29] "region_code" "scheme_management"
## [31] "scheme_name" "source"
## [33] "source_class" "source_type"
## [35] "status_group" "subvillage"
## [37] "ward" "water_quality"
## [39] "waterpoint_type" "waterpoint_type_group"
## [41] "wpt_name"
The list of variables is quite long. To structure the analysis a bit, they were sorted into classes:
## Lake Victoria Pangani Rufiji
## 10248 8940 7976
## Internal Lake Tanganyika Wami / Ruvu
## 7785 6432 5987
## Lake Nyasa Ruvuma / Southern Coast Lake Rukwa
## 5085 4493 2454
The basin variable gives the geographical location of the water point and in some cases probably also the source of the water as for Lake Victoria for example. With nine levels the danger of overfitting should be minimal. And since basins are caused by natural water sources they might directly influence the water points’ functionality.
## Max. 3rd Qu. Mean Median 1st Qu. Min.
## 2770.0 1319.0 668.3 369.0 0.0 -90.0
The height on which the water point is situated might influence the water points functionality, e.g. by available water sources or climate. The span of heights is quite big, ranging from -90m to 2770m. The large number of water points at 0m is probably rather due to them being NAs, but they could in theory also be the real height.
## Iringa Shinyanga Mbeya Kilimanjaro Morogoro Arusha
## 5294 4982 4639 4379 4006 3350
## Kagera Mwanza Kigoma Ruvuma
## 3316 3102 2816 2640
The regions are the coarsest federal unit in Tanzania. As political instance the differences in politics between regions might also influence the functionality of water points, e.g. by subsidizing.
## Max. 3rd Qu. Mean Median 1st Qu. Min.
## 99.0 17.0 15.3 12.0 5.0 1.0
The region code should be just a coded version of the region variable. This would make it redundant. Since regions are not a continuous variable, the non-coded version of the variable might be better suited. There are more unique region codes (27) than region names (21) in the data set. Thus there might be faulty data.
## Max. Mean 3rd Qu. Median 1st Qu. Min.
## 80.00 5.63 5.00 3.00 2.00 0.00
Districts are the next smaller federal unit in Tanzania. There are 169 districts in total, but only 20 appear in the data set. This hints either to a misleading feature label or to faulty data.
## (Other) Njombe Arusha Rural Moshi Rural Bariadi
## 3437 2503 1252 1251 1177
## Rungwe Kilosa Kasulu Mbozi Meru
## 1106 1094 1047 1034 1009
The local government authority is the government for on a level smaller than the regions. Most of the time they should be overlapping with the districts. Thus this variable might be redundant to district_code, but again there is a discrepancy in the number of unique values.
## (Other) Igosi Imalinyi Siha Kati Mdandu Nduruma Kitunda
## 47666 307 252 232 231 217 203
## Mishamo Msindo Chalinze
## 203 201 196
Wards are again a smaller federal unit consisting of up to 21000 people. The number of levels in this feature is huge, which might interfere with modeling. There seems to be a substantive difference in number of water points between wards although they are divided by population. Thus over usage might occur in some wards.
## (Other) Madukani Shuleni Majengo Kati Mtakuja Sokoni
## 50316 508 506 502 373 371 262 232
## M Muungano
## 187 172
## Isanga Mtaa Wa Kipunguni B Mwangaza
## 34 35 35
## Njia Panda Njiapanda Tankini
## 35 35 35
## Chemchem Kijijini Mchangani
## 36 36 36
## Temeke
## 36
Subvillage is the finest federal unit and contains even more factor levels. This variable should behave similar to the wards variable.
## Max. 3rd Qu. Median Mean 1st Qu. Min.
## -2.000e-08 -3.326e+00 -5.022e+00 -5.706e+00 -8.541e+00 -1.165e+01
## Max. 3rd Qu. Median Mean 1st Qu. Min.
## 40.35 37.18 34.91 34.08 33.09 0.00
The coordinates encode the geographic location with the highest resolution in this data set. At the same time, they are continuous rather than discrete. Also a value of 0 again represents NAs, since neither the null meridian nor the equator run through Tanzania, and were thus not plotted. The peak at the latitude of about -3 is probably there because that includes the southern shore of Lake Victoria, the Serengeti National Park and the Kilimanjaro National Park, which are highly populated and very touristic.
## functional non functional functional needs repair
## 32259 22824 4317
Slightly more than half of the water points are functional (`0.5430808). About 38.4242424 % are non functional, which is a quite big fraction. A significantly smaller amount is still functional but needs repair (0.0726768).
## Max. Mean 3rd Qu. Min. 1st Qu. Median
## 350000.0 317.7 20.0 0.0 0.0 0.0
The total static head (tsh) is a rather technical measure. Most values are given as zero. Tsh is a measure giving the work a pump must perform to deliver water to the surface. Since this is a pretty technical value, it is probably only measured at water points that are regularly maintained. Thus the high number of zeros, which probably are NAs, might be indicative of bad maintenance. But TSH is also only a value needed for water points worked by a pumping mechanism. Since only a fraction of water points will use pumping mechanisms, some NAs can also be accounted by this fact.
## Max. 3rd Qu. Median Mean Min. 1st Qu.
## 2013 2004 1986 1301 0 0
The number of water points built each year seems to be growing although it could also be that the oldest ones already have been demolished and thus were not considered anymore in the data collection.
## gravity nira/tanira other submersible
## 26780 8154 6430 4764
## swn 80 mono india mark ii afridev
## 3670 2865 2400 1770
## ksb other - rope pump
## 1415 451
## gravity nira/tanira other submersible swn 80
## 26780 8154 6430 6179 3670
## mono india mark ii afridev rope pump other handpump
## 2865 2400 1770 451 364
## gravity handpump other submersible motorpump
## 26780 16456 6430 6179 2987
## rope pump wind-powered
## 451 117
These three variables represent the same data with different levels of detail. In the higher detailed variables (extraction_type, extraction_type_group) the level of detail for all levels does not seem consistent (e.g. India Mark ii vs. Gravity). This might be unsuitable for prediction. Since using all of those features does not make sense, since none should add significant additional information, the variable with the lowest amount of detail, but being the cleanest might be the best choice. It seems there are only few motorized water points. The most are operated by natural forces like gravity or wind, followed by manually operated water points.
## soft salty unknown
## 50818 4856 1876
## milky coloured salty abandoned
## 804 490 339
## fluoride fluoride abandoned
## 200 17
## good salty unknown milky colored fluoride
## 50818 5195 1876 804 490 217
For the water quality there are again more than one variable of which one was cleaned up by accumulating levels that may have been overly detailed. It seems that most water points actually serve good quality water.
## enough insufficient dry seasonal unknown
## 33186 15129 6246 4050 789
## enough insufficient dry seasonal unknown
## 33186 15129 6246 4050 789
For these two variables the levels are actually identical, thus only quantity has to be considered. The number of water points giving sufficient water is very similar to the number of functional water points. Quite a big amount of water points either give insufficient water, only seasonal or are completely dry.
## spring shallow well machine dbh
## 17021 16824 11075
## river rainwater harvesting hand dtw
## 9612 2295 874
## lake dam other
## 765 656 212
## unknown
## 66
## spring shallow well borehole
## 17021 16824 11949
## river/lake rainwater harvesting dam
## 10377 2295 656
## other
## 278
## groundwater surface unknown
## 45794 13328 278
For this variable it might not be the best choice to use the most cleaned up one, since it just differentiates between ground water and surface water or unknown sources. On the other hand, the source_type variable additionally differentiates between subtypes, while still being clean. Most water sources are groundwater sources rather than surface water sources, which would be expected in dry climate.
## communal standpipe hand pump
## 28522 17488
## other communal standpipe multiple
## 6380 6103
## improved spring cattle trough
## 784 116
## dam
## 7
The waterpoint_type variable seems to be slightly redundant to source_type and extraction_type considering the level names. But it only contains 7 instances of the level dam, while source_type contains 656 instances of dam. There might be faulty data in the dataset. Since waterpoint_type contains a considerable number of NAs (other), it might also be the lack of data causing the discrepancy.
## (Other) Government Of Tanzania
## 11757 9084 3635
## Danida Hesawa Rwssp
## 3114 2202 1374
## World Bank Kkkt World Vision
## 1349 1287 1246
## Unicef
## 1057
There are quite a lot of funders for water points, but the government funds by far the most water points. There sheer number of levels will make it useless for direct use in classification, but additional more useful features might be extracted from it.
## DWE (Other) Government RWE Commu
## 17402 12250 3655 1825 1206 1060
## DANIDA KKKT Hesawa 0
## 1050 898 840 777
For the installer variable it is similar as for the funder variable. Again the government or a government department (DWE) is the biggest installer.
## vwc wug water board wua
## 40507 6515 2933 2535
## private operator parastatal water authority other
## 1971 1768 904 844
## company unknown
## 685 561
## user-group commercial parastatal other unknown
## 52490 3638 1768 943 561
Like some variables discussed above two variables show the same data, just cleaned up. Here the less detailed variable seems to be the better choice, since it seems to be more consistent. Most water points actually seem to be managed by the users themselves, which would probably not be the most efficient way.
## never pay pay per bucket pay monthly
## 25348 8985 8300
## unknown pay when scheme fails pay annually
## 8157 3914 3642
## other
## 1054
## never pay per bucket monthly unknown on failure annually
## 25348 8985 8300 8157 3914 3642
## other
## 1054
About half of the water points can be used for free. A big part of those might just be rivers and lakes. The other half is paid by different schemes, e.g. per use or monthly.
## True False
## 38852 17492 3056
Most water points are permitted. But it is unclear whether that means that building this water point was permitted or whether you need a permit to collect water from it.
## Max. 3rd Qu. Mean Median Min. 1st Qu.
## 30500.0 215.0 179.9 25.0 0.0 0.0
The population variable is quite skewed. Most water points are in an area with very low population.
## True False
## 51011 5055 3334
In most places there seem to be public meetings.
## (Other)
## 28166 19572
## K None
## 682 644
## Borehole Chalinze wate
## 546 405
## M DANIDA
## 400 379
## Government Ngana water supplied scheme
## 320 270
## VWC WUG Water authority
## 36793 5206 3877 3153
## WUA Water Board Parastatal Private operator
## 2883 2748 1680 1063
## Company Other
## 1061 766
The scheme_name variable contains a lot of levels, which probably only contain little information. The overall scheme categories are summarized in the scheme_management variable. The most common theme is the Village Water Committee (VWC), which fits well to the management_group variable, which states that most are managed by user groups.
The biggest observation that could be done in the univariate analysis is probably that the data is in need of some cleaning. There are a lot of variables useless for prediction algorithms, either because of having too many unique values, being redundant or are unique to each entry. For the data acquisition related variables, we can remove the id, num_private, recorded_by, and wpt_name variables. For the geographical location variables, we will be able to remove the region_code, district code, ward and subvillage variable. The lga variable could be interesting, but it might be slightly redundant to the region variable, thus only one of those should be chosen. Since lga has 125 unique categorical values, which cannot be handled by R-implementations of algorithms like random forest, it will be removed as well. The water point properties contain some variables that are apparently intermediates of previous data cleanup. Of those variables only one should be used. Some considerations were already described above.
The same is true for the water point management variables. Additionally, the funder and installer variables contain too many unique values and will be removed. But for those variables new better usable variables will be created to model the experience of the respective funder/installer. To do so the number of wells installed or funded by a company will used as an experience score.
## Max. Mean 3rd Qu. Median 1st Qu. Min.
## 9084 1942 1374 470 78 1
## 3rd Qu. Max. Mean Median 1st Qu. Min.
## 17400 17400 5351 408 71 1
Both plots look pretty similar. Most water points are funded and installed by organizations with little ‘experience’, but in both cases there is a huge outlier, which would probably be the government as discussed for the original variables.
The variable most interesting as a dependent variable is status_group, which is also the label in the contest, the data was taken from, since predicting the functional status of a water point would allow to better organize maintenance. But there might be other interesting connections in the data. For example, the water quality could connect to the quantity of water or how regional differences influence the different variables.
General note on plots: The categorical variables will be plotted as stacked bar plots, where the color indicates the functionality, the count of water points per level is plotted on y and the variable value on x. Both the absolute and the relative distribution of counts will be plotted. In the plots containing the relative values a black line indicates the relative number of functional water points in the whole data set. This was done to give an orientation about whether water points in certain categories perform worse or better than the overall average. This approach has a caveat: If a category or bin contains comparatively few entries the fraction of functional water points might not be representative. This happens quite often in this dataset, illustrating the inhomogeneity and skewedness of the data.
There is no major change in status group fractions recorded over time. That is means, that there is no bias due to the recording time and that there were no rare events causing unusual changes, which could lead to outliers negatively influencing prediction efforts. This also means that the date_recorded variable would not be adding a lot of information to a model und is probably not useful.
While in most regions the fraction of functional water points is close to the overall mean fraction, the regions Lindi and Mtwara and to a lesser extend Mara, Ruvuma and Tabora perform worse. On the other hand, Arusha, Iringa, Kilimanjaro and Manyara possess above average functioning water points. This indicates that there could be political or geographical factors influencing the status of water points.
Like the region variable the basin variable contains values performing better or worse than the overall performance. Interestingly the Lake Rukwa and Ruvuma/southern Coast basins contain data entries from regions with lower amounts of functional water points. Although the categorization into basins relates the data to a more geographical than political context, political influences might still be a major confounder, since the overlap is most of the time quite big.
## basin
## region Internal Lake Nyasa Lake Rukwa Lake Tanganyika
## Arusha 1309 0 0 0
## Dar es Salaam 0 0 0 0
## Dodoma 827 0 0 0
## Iringa 0 1582 0 0
## Kagera 0 0 0 341
## Kigoma 0 0 0 2816
## Kilimanjaro 169 0 0 0
## Lindi 0 0 0 0
## Manyara 1206 0 0 0
## Mara 0 0 0 0
## Mbeya 0 2430 1427 0
## Morogoro 0 0 0 0
## Mtwara 0 0 0 0
## Mwanza 0 0 0 99
## Pwani 0 0 0 0
## Rukwa 0 0 1011 797
## Ruvuma 0 1073 0 0
## Shinyanga 1641 0 0 1072
## Singida 1992 0 1 8
## Tabora 641 0 15 1299
## Tanga 0 0 0 0
## basin
## region Lake Victoria Pangani Rufiji Ruvuma / Southern Coast
## Arusha 32 2009 0 0
## Dar es Salaam 0 0 0 0
## Dodoma 0 0 359 0
## Iringa 0 0 3712 0
## Kagera 2975 0 0 0
## Kigoma 0 0 0 0
## Kilimanjaro 0 4210 0 0
## Lindi 0 0 90 1456
## Manyara 0 288 0 0
## Mara 1969 0 0 0
## Mbeya 0 0 782 0
## Morogoro 0 0 1893 0
## Mtwara 0 0 0 1730
## Mwanza 3003 0 0 0
## Pwani 0 0 784 0
## Rukwa 0 0 0 0
## Ruvuma 0 0 260 1307
## Shinyanga 2269 0 0 0
## Singida 0 0 92 0
## Tabora 0 0 4 0
## Tanga 0 2433 0 0
## basin
## region Wami / Ruvu
## Arusha 0
## Dar es Salaam 805
## Dodoma 1015
## Iringa 0
## Kagera 0
## Kigoma 0
## Kilimanjaro 0
## Lindi 0
## Manyara 89
## Mara 0
## Mbeya 0
## Morogoro 2113
## Mtwara 0
## Mwanza 0
## Pwani 1851
## Rukwa 0
## Ruvuma 0
## Shinyanga 0
## Singida 0
## Tabora 0
## Tanga 114
The first impression is that the higher the water point location the likelier it is that it is functional. This is probably at least in part influenced by the far lower amount of water points at great heights and might thus be a sampling effect. On the other hand, there might be confounding factors like lower usage, better climate conditions or outdoor tourism causing an increased number of functional water points.
As for the gps_height variable, there might also be a sampling problem for amount_tsh. In this case this could actually be directly transferred into useful information. The tsh measure is probably not easily obtainable, maybe even only by professionals. Thus a measured tsh might indicate maintenance of the water point at the time of recording, causing the observed high fractions of functioning water points for entries with given amount_tsh values. (amount_tsh values were log10-transformed.)
As one would expect older water points to be more often nonfunctional than newer ones. There seems to be a nice linear relationship between the construction year and the relative amount of functional water points. The number of water points that need repair is relatively steady, probably because it is a rather transient state.
This variable like others as well shows nicely that if a value is missing for an entry (here called ‘other’), then the water point is very likely nonfunctional. Also motor pumps seem to break more often than other extraction mechanisms. Gravity and hand pumps, which are also the most common extraction methods, are also the most reliable methods next to rope pumps.
Fluorided water, although quite uncommon, seems to correlate with a higher amount of functional water points. This again might be due to better maintenance, since fluoride is added as a disinfectant, which is probably mostly done for tourists and richer areas, where better care is taken for the water points. This is further confirmed by the high fraction of nonfunctional water points of water points where the water quality is unknown.
This variable very closely correlates to the status_group variable. Since dry water points are part of the definition of nonfunctional water points it is not surprising that nearly all dry water points are labeled nonfunctional. Water points with no quantity description are mostly nonfunctional as well. Water points giving enough water, which are the majority, are above average functional.
Water points using natural water sources (rainwater, rivers/lake, spring) are more often functional than manmade (borehole, dam, shallow well). Surprisingly the ‘unknown’ category is about average in number of functional water points in contrast to the previous variables.
The waterpoint_type variable strongly illustrates that as long as a value is given, the probability that the water point is functional is very high. As for the univariate analysis the frequencies given for dams are inconsistent to the ones given in the source variable. Since there are more entries labeled dam in source those frequencies might be more representative.
There does not seem to be a very meaningful trend in the data. Companies funding few water points do not influence functionality in a better or worse way than organizations that fund a much higher number of water points. Although the outlier represented by government funded water points seems to deviate from the average by showing a lower number of functional water points.
As for funders installer experience does not seem to influence functionality much. There is no consistent trend in the data.
Commercial management systems seem to be slightly better for water point functionality although by a probably insignificant margin. Again the ‘unknown’ category correlates with a higher amount of nonfunctional water points.
The data shows that payment actually seems to help keeping the water points functional. The vast majority is not payed (which is good from an ethical point of view), but the number of functional water points is below average. Payed water points, especially with regular payments, tend to be functional more often.
Permitting water points does not seem to have a big influence on functionality. It might have a slight positive effect, but that is probably neglectable.
It seems that a population of about 10-100 people around a water point is most favorable in terms of functionality. With no or only a few people around the interest to maintain the water point might be too low and having too many people around might lead to over usage and a higher maintenance demand on the water point. Very high numbers of people living around a water point give apparent high frequencies of functional water points. Due to the overall low number of those water points this might be a sampling error. But it might also hint to a confounding factor like higher maintenance interest or it is a lake, which because of its size alone might have a lot of people living around it and would since no mechanisms are needed to retrieve water is more likely to be functional.
Public meetings seem to have a slight positive effect on water point functionality.
State Water Commissions (SWC) seem to be a not working very well as a management system for water points, but it is apparently also only scarcely used. The best working management schemes are Water Boards, Water User Associations (WUA), trusts and private operators. Water points managed by the village water committee (VWC) scheme, the most common scheme, seem to have an about average number of functional water points.
The date_recorded variable worked well as a sanity check, that there is no fluctuation in the functionality data depending on the recording time. But it probably won’t be useful as a predictor. The location of the water points seems to influence the amount of functional water points. It is hard to say whether this is due to geographical properties, like their geographical height, of the respective regions or political factors. The analysis clearly showed that larger fraction of older water points tends to be broken than newer ones, which makes construction_year a good predictor. Additionally, throughout most variables values representing NAs seem to correlate with a high fraction of nonfunctional water points. Thus the use of the number of NAs as a predictor could be considered. The main idea behind using NAs as a predictor is that for some metrics like amount_tsh some expertise or equipment might be needed, thus if the metric is measured it might also be more probable that the water point is maintained by professionals. Other probably more useful predictors would be extraction_type_class, quality_group, quantity_group, payment_type (may be cleaned further by using the following labels: unknown, source_type, never pay, recurring pay, on demand pay), population, and scheme_management. Public_meeting, management_group, waterpoint_type, amount_tsh might be also useful to some extent. But installer_count, funder_count and permit would probably not greatly influence the prediction.
Plotting every water point onto a map of Tanzania and coloring it by their status does not very intuitively show, whether there is a pattern in the water point’s location causing differences in functionality:
For better orientation in later plots the mapped water points are colored by their region and basin respectively in the following plots:
Since the high number of water points makes it difficult to find patterns on the map, the water points will be binned by their coordinates and then plotted on the map. On the first map the fraction of functional water points per bin will be plotted. Bins with a value below 0.5 mostly contain nonfunctional water points and are colored red. Blue bins contain mostly functional water points.
There are some areas that have a high number of bins that contain more nonfunctional water points than functional. Those overlap especially with the southern coast and the area between the lakes Rukwa and Tanganyika, which corresponds well to regions and basins that were identified before in the bivariate analysis. But there are also a lot of bins with mostly nonfunctional water points up the coast and in parts of the inland and coast of Lake Victoria.
A variable that showed good correlation with functionality in the bivariate analysis was the construction year. Plotted to the map this data also shows that there are three areas with mostly older water points. Two of those overlap well with the regions Mtwara, Lindi and Rukwa, which were already identified as regions with many nonfunctional water points. But also the south of the region Singida has mostly quite old water points, but has only a sparse density of water points. There are big areas with unknown construction_year values in the inland. Thus the occurrence of NAs might also be regional.
The median population around water points does not seem to correlate with the location and the functionality in the bins, except for the Rukwa region.